White wine exploration by Olivier Vernin

Univariate plot section

library(ggplot2)
library(gridExtra)
library(GGally)
library(scales)

wqw <- read.csv("/Users/olivier/Desktop/Udacity/rstudio/assignment/wineQualityWhites.csv")

Number of rows

nrow(wqw)
## [1] 4898

Variables

names(wqw)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
summary(wqw)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

As very new to wine characteristic, I did some research on the variables name to understand their impact on wine taste.:

The X is the anonymized unique ID of the wine, so let's make it as factor.

wqw$X <- as.factor(wqw$X)

Quality

As our task is to indentify the chimical propoerties which influence the quality, let's lot at it first.

ggplot(aes(x=quality), data=wqw) + geom_histogram()

plot of chunk unnamed-chunk-6

The distribution is discrete, let's change the binwidth to 1.

ggplot(aes(x=quality), data=wqw) + geom_histogram(binwidth=1)

plot of chunk unnamed-chunk-7 The distribution of the quality look kind of normal with a peak at 6.

summary(wqw$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The minium of 3, maxium of 9 and 50% of the values between 5 and 6.

table(wqw$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Quality values are very concentrate. Let's see which percentage of the sample each value represents.

prop.table(table(wqw$quality))
## 
##           3           4           5           6           7           8 
## 0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869 
##           9 
## 0.001020825

Well that's around 45% of the wine with 6, nearly half of the sample. 6 seems like a very average value. The sum of 5, 4 and 3 account for around 33%. The sum of 7,8 and 9 account for around 22%. Seems that we could use those group to categorize our wines.

wqw$quality.group <- cut(wqw$quality, labels=c('low', 'average', 'high'), breaks=c(0, 5, 6, 10))

Fixed Acidity (g/L)

Fixed acidity is indicate as tartaric acid in the data description. Tartaric acid is a distinctive molecule. However when searching for fixed acid, the documentation read indicate it's a class of acid and tartaric acid and acid and citric acid is part of them. [http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity]

ggplot(aes(x=fixed.acidity), data=wqw) + geom_histogram()

plot of chunk unnamed-chunk-12

For the fixed acidity we have a normal distribution with a few outlines with 12 and 14. The majority of the wine have between 6.3 and 7.3.

Let's try to use a better bin size.

table(wqw$fixed.acidity)
## 
##  3.8  3.9  4.2  4.4  4.5  4.6  4.7  4.8  4.9    5  5.1  5.2  5.3  5.4  5.5 
##    1    1    2    3    1    1    5    9    7   24   23   28   27   28   31 
##  5.6  5.7  5.8  5.9    6  6.1 6.15  6.2  6.3  6.4 6.45  6.5  6.6  6.7  6.8 
##   71   88  121  103  184  155    2  192  188  280    1  225  290  236  308 
##  6.9    7  7.1 7.15  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9    8  8.1  8.2 
##  241  232  200    2  206  178  194  123  153   93   93   74   80   56   56 
##  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7 
##   52   35   32   25   15   18   16   17    6   21    3   11    2    5    4 
##  9.8  9.9   10 10.2 10.3 10.7 11.8 14.2 
##    8    2    3    1    2    2    1    1

The measurment have a 0.1 precision.

ggplot(aes(x=fixed.acidity), data=wqw) +
  geom_histogram(binwidth=.1)

plot of chunk unnamed-chunk-14

Volatile Acidity (g/L)

From online research, volatile.acidity is the steam of distillable acids. Note that the US legal limit is 1.1 g/L. I assume that our data are in g/L. It is normaly not detectable up to 3g/L.

ggplot(aes(x=volatile.acidity), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-15

The shape of the volatile acidity is approaching normal distribution with 75% below 0.32g/L. We also see a kind of tail effect with heigher values. We can see some a few outline around 0.9g/l and the top value at 1.1g/L which is right on the US legal limit.

Let's find a more fine grain bin stat

table(wqw$volatile.acidity)
## 
##  0.08 0.085  0.09   0.1 0.105  0.11 0.115  0.12 0.125  0.13 0.135  0.14 
##     4     1     1     6     6    13     3    34     3    44     1    56 
## 0.145  0.15 0.155  0.16 0.165  0.17 0.175  0.18 0.185  0.19   0.2 0.205 
##     4    88     5   141     2   140     1   177     5   170   214     4 
##  0.21 0.215  0.22 0.225  0.23 0.235  0.24 0.245  0.25 0.255  0.26 0.265 
##   191     1   229     4   216     4   253     4   231    10   240     5 
##  0.27 0.275  0.28 0.285  0.29 0.295   0.3 0.305  0.31 0.315  0.32 0.325 
##   218     3   263     5   160     3   198     4   148     4   182     2 
##  0.33 0.335  0.34 0.345  0.35 0.355  0.36 0.365  0.37 0.375  0.38 0.385 
##   134     7   135     9    86     1   104     2    65     2    63     2 
##  0.39 0.395   0.4 0.405  0.41 0.415  0.42 0.425  0.43 0.435  0.44 0.445 
##    61     2    59     1    54     4    36     2    35     2    46     4 
##  0.45 0.455  0.46  0.47 0.475  0.48 0.485  0.49 0.495   0.5  0.51  0.52 
##    25     2    30    15     3    17     3    14     2    14    10    10 
##  0.53  0.54 0.545  0.55 0.555  0.56  0.57  0.58 0.585  0.59 0.595   0.6 
##     8    10     1    14     2     9     4     7     2     4     2     7 
##  0.61 0.615  0.62  0.63  0.64  0.65 0.655  0.66  0.67  0.68 0.685  0.69 
##     7     4     5     2     7     2     3     4     5     3     1     2 
## 0.695 0.705  0.71  0.73  0.74  0.75  0.76  0.78 0.785 0.815  0.85 0.905 
##     3     2     1     1     1     1     2     1     1     1     1     1 
##  0.91  0.93 0.965 1.005   1.1 
##     1     1     1     1     1

Seems that the precision is 0.005

ggplot(aes(x=volatile.acidity), data=wqw) + geom_histogram(binwidth=.005) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-17

I have the impression that the sampling of the machine that took the sample was not properly done. We get many 0.0X precision and very few 0.0X5 precision. I will adopt a 0.01 bin size to smooth the plot.

ggplot(aes(x=volatile.acidity), data=wqw) + geom_histogram(binwidth=.01) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-18

Citric Acid (g/L)

From my internet search citric acid is contributing to the fixed acidity. It' is usualy present between 0 to 0.5g/L in wine.

ggplot(aes(x=citric.acid), data=wqw) + geom_histogram()

plot of chunk unnamed-chunk-19 Cirtic acid seems to follow a normal distribution with a peak at 0.3g/L. Again a few outliners at 1.25g/L and 1.7g/L.

Let's fine a finer grain bin size

table(wqw$citric.acid)
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##   19    7    6    2   12    5    6   12    4   12   14    1   19   17   27 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   23   33   27   49   48   70   66  104   83  181  136  219  216  282  223 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##  307  200  257  183  225  137  177  134  122  101  117   82   95   37   63 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   46   51   38   39  215   35   25   23   16   19   11   22   13   21    6 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    6    9   14    4    6    8    7    7    7    5    3    9    5    5   41 
## 0.78 0.79  0.8 0.81 0.82 0.86 0.88 0.91 0.99    1 1.23 1.66 
##    2    2    2    2    2    1    1    2    1    5    1    1

The data precision is 0.01.

ggplot(aes(x=citric.acid), data=wqw) + 
  geom_histogram(binwidth=.01)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-21

table(subset(wqw, citric.acid == .49 | citric.acid == .74)$citric.acid)
## 
## 0.49 0.74 
##  215   41

The distribution seems normal except of a few outlines at 1.66g/L or 1.23g/L.

Note the 2 sharp peaks of concentration:

Instinctively, it seems the result of a carefully controlled additive to the wine. Indeed citric acid can be used to boost acidity and add “freshness [http://en.wikipedia.org/wiki/Acids_in_wine#Citric_acid]. But one shouldn't add too much otherwise as the it adds a strong citric flavor.

Let's create a categorical variable for those value of citric acid.

wqw$added.citric.acid <- ifelse(wqw$citric.acid == .49 | wqw$citric.acid ==.74, 'yes', 'no')

Residual Sugar (g/L)

The residual sugar that was not transformed during frementation in g/L.

ggplot(aes(x=residual.sugar), data=wqw) + geom_histogram()

plot of chunk unnamed-chunk-24

We have a more long tail distribution, but the value at 1500 is maybe due to too wide bin. Also a few outliner at above 60 and around 30.

Let's look a bit more at the values.

table(wqw$residual.sugar)
## 
##   0.6   0.7   0.8   0.9  0.95     1  1.05   1.1  1.15   1.2  1.25   1.3 
##     2     7    25    39     4    93     1   146     3   187     3   147 
##  1.35   1.4  1.45   1.5  1.55   1.6  1.65   1.7  1.75   1.8  1.85   1.9 
##     2   184     4   142     2   165     2    99     1    99     3    59 
##  1.95     2  2.05   2.1   2.2  2.25   2.3  2.35   2.4   2.5   2.6  2.65 
##     2    79     1    51    56     2    42     1    41    40    33     1 
##   2.7   2.8  2.85   2.9     3   3.1  3.15   3.2   3.3   3.4   3.5   3.6 
##    38    36     1    25    17    17     1    28    23    13    31    22 
##   3.7  3.75   3.8  3.85   3.9  3.95     4   4.1   4.2  4.25   4.3  4.35 
##    12     2    21     3    17     3    19    17    31     2    19     1 
##   4.4  4.45   4.5  4.55   4.6   4.7  4.75   4.8  4.85   4.9     5   5.1 
##    14     3    33     2    40    29     5    38     1    35    43    28 
##  5.15   5.2  5.25   5.3  5.35   5.4  5.45   5.5  5.55   5.6   5.7   5.8 
##     2    29     4    17     2    23     2    13     1    16    30    23 
##  5.85   5.9  5.95     6   6.1   6.2   6.3  6.35   6.4   6.5  6.55   6.6 
##     2    19     1    23    21    31    39     1    34    26     1    30 
##  6.65   6.7  6.75   6.8  6.85   6.9  6.95     7  7.05   7.1   7.2  7.25 
##     3    25     1    28     6    20     1    31     2    36    29     2 
##   7.3  7.35   7.4  7.45   7.5   7.6   7.7  7.75   7.8  7.85   7.9  7.95 
##    19     2    40     1    30    29    34     2    41     1    32     1 
##     8   8.1  8.15   8.2  8.25   8.3   8.4  8.45   8.5  8.55   8.6  8.65 
##    32    34     1    36     2    31    13     1    24     1    27     1 
##   8.7  8.75   8.8   8.9  8.95     9  9.05   9.1  9.15   9.2  9.25   9.3 
##    18     2    22    23     1    18     1    17     2    22     2    11 
##   9.4   9.5  9.55   9.6  9.65   9.7   9.8  9.85   9.9    10 10.05  10.1 
##    10     9     1    18     4    22    16     3    18    18     3    14 
##  10.2  10.3  10.4  10.5 10.55  10.6 10.65  10.7  10.8  10.9    11  11.1 
##    23    16    25    16     1    22     1    26    17    11    19    18 
##  11.2 11.25  11.3  11.4 11.45  11.5  11.6  11.7 11.75  11.8  11.9 11.95 
##    18     2    12    14     1    11    15     8     4    35    16     3 
##    12 12.05  12.1 12.15  12.2  12.3  12.4  12.5 12.55  12.6  12.7 12.75 
##    16     1    21     4    15    13    19    16     2    16    16     1 
##  12.8 12.85  12.9    13  13.1 13.15  13.2  13.3  13.4  13.5 13.55  13.6 
##    25     4    25    19    23     1    13    16     7    10     3    12 
## 13.65  13.7  13.8  13.9    14 14.05  14.1 14.15  14.2  14.3 14.35  14.4 
##     4    21     8    18    16     1     4     1    20    17     3    17 
## 14.45  14.5 14.55  14.6  14.7 14.75  14.8  14.9 14.95    15  15.1 15.15 
##     3    17     3    13    14     2    12    14     2    13     7     1 
##  15.2 15.25  15.3  15.4  15.5 15.55  15.6  15.7 15.75  15.8  15.9    16 
##     6     1     9    17    11     6    14     9     1     6     2    10 
## 16.05  16.1  16.2  16.3  16.4 16.45  16.5 16.55  16.6 16.65  16.7 16.75 
##     6     2     7     7     5     1     3     1     2     5     5     2 
##  16.8 16.85  16.9 16.95    17 17.05  17.1  17.2  17.3 17.35  17.4 17.45 
##     4     4     3     3     1     1     5     9    14     1     2     2 
##  17.5 17.55  17.6  17.7 17.75  17.8 17.85  17.9 17.95    18 18.05  18.1 
##     8     3     2     1     4    13     5     2     3     2     3     6 
## 18.15  18.2  18.3 18.35  18.4  18.5  18.6 18.75  18.8  18.9 18.95  19.1 
##     8     3     2     4     1     1     1     4     3     1     3     1 
## 19.25  19.3 19.35  19.4 19.45  19.5  19.6  19.8  19.9 19.95 20.15  20.2 
##     3     4     1     2     3     2     1     4     1     3     1     2 
##  20.3  20.4  20.7  20.8    22  22.6  23.5 26.05  31.6  65.8 
##     1     1     2     2     2     1     1     2     2     1

No values higher than 150, and it seems that 0.1 would be a right binwidth. Let's also remove the outliners

ggplot(aes(x=residual.sugar), 
       data=subset(wqw, residual.sugar < quantile(residual.sugar, .9))) + 
  geom_histogram(binwidth=0.1) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-26

It's a more acceptable visualization.

Let see if we can get a normal distributon by taking the sqrt of the residual sugar.

ggplot(aes(x=sqrt(residual.sugar)), data=wqw) + geom_histogram(binwidth=0.1) 

plot of chunk unnamed-chunk-27

Well not very convincing…

Let see with log10 of the residual sugar.

ggplot(aes(x=log10(residual.sugar)), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-28

Seems a bit better, we get a bimodale normal distribution.

Wine Sweetness

As describe on http://en.wikipedia.org/wiki/Sweetness_of_wine there are categories of wine regarding sweetness.

ggplot(aes(x=residual.sugar), 
       data=wqw) + 
  geom_histogram(binwidth=0.1) +
  scale_x_continuous(breaks=c(4, 12, 45))
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Error: arguments imply differing number of rows: 666, 655

It seems that we have a majority of dry wines… let's create a factor variable.

wqw$sweetness <- cut(wqw$residual.sugar, labels=c('dry', 'medium dry', 'medium', 'sweet'), breaks=c(0, 4, 12, 45, 500))
prop.table(table(wqw$sweetness))
## 
##         dry  medium dry      medium       sweet 
## 0.428133932 0.403225806 0.168436096 0.000204165

Here is the proportion of wine sweetness in our sample:

Chlorides (g/L)

The amount of salt in the wine in g/L.

ggplot(aes(x=chlorides), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-32

table(wqw$chlorides)
## 
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 0.021 0.022 
##     1     1     1     4     4     5     5    10     9    16    19    19 
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 0.033 0.034 
##    20    34    30    54    58    85    81   108   107   109   119   168 
## 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 0.045 0.046 
##   130   200   160   167   157   182   147   184   141   201   170   181 
## 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 
##   171   174   133   170   115   104   130    99    61    88    68    53 
## 0.059  0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069  0.07 
##    36    46    19    25    23    15     8    18    18     7    18     6 
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079  0.08 0.081 0.082 
##     5     2     5     8     2     9     1     2     4     4     2     2 
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089  0.09 0.091 0.092 0.093 0.094 
##     5     5     3     4     3     2     1     2     1     3     3     5 
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108  0.11 0.112 0.114 
##     2     6     1     3     1     1     1     1     2     3     1     1 
## 0.115 0.117 0.118 0.119  0.12 0.121 0.122 0.123 0.126 0.127  0.13 0.132 
##     1     3     1     3     1     2     1     4     3     2     1     1 
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149 
##     1     1     1     2     2     3     1     1     1     2     1     1 
##  0.15 0.152 0.154 0.156 0.157 0.158  0.16 0.167 0.168 0.169  0.17 0.171 
##     1     2     1     1     4     1     2     2     3     2     2     1 
## 0.172 0.173 0.174 0.175 0.176 0.179  0.18 0.184 0.185 0.186 0.194 0.197 
##     2     2     2     2     2     1     1     2     2     1     1     2 
##   0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239  0.24 0.244 0.255 
##     1     2     1     2     1     1     1     1     1     1     1     1 
## 0.271  0.29 0.301 0.346 
##     1     1     1     1

Seems like 0.001 would be appropriate bin. Let's also remove 1% of high values.

ggplot(aes(x=chlorides), 
       data=subset(wqw, chlorides < quantile(chlorides,.99))) + 
  geom_histogram(binwidth=.001) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-34

The graph is now following a normal distribution between 0.009 to 0.069. However we still have a long tail from 0.08 up to 0.16.

Let's try without the 5% highest values.

ggplot(aes(x=chlorides), 
       data=subset(wqw, chlorides < quantile(chlorides,.95))) + 
  geom_histogram(binwidth=.001) 

plot of chunk unnamed-chunk-35

Free Sulfur Dioxide (mg/L)

Free sulfur dioxide represent the free molecule of S02 in mg/dm3 and work as a preservative. This molecule is easily detectable above 50ppm.

ggplot(aes(x=free.sulfur.dioxide), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-36

We have at least one outliner at around 300 and the bin with are too big with peak at 1250.

table(wqw$free.sulfur.dioxide)
## 
##     2     3     4     5     6     7     8     9    10    11  11.5    12 
##     1    10    11    25    32    25    35    29    55    45     1    51 
##    13    14    15  15.5    16    17    18    19  19.5    20    21    22 
##    55    68    79     1    58    89    80    84     1   101    93   102 
##    23  23.5    24    25    26    27    28  28.5    29    30  30.5    31 
##   110     1   118   111   129    99   112     1   160    99     1   132 
##    32    33    34    35  35.5    36    37    38  38.5    39  39.5    40 
##   109   112   128   129     2   127   111   102     1    89     1   103 
##  40.5    41  41.5    42  42.5    43  43.5    44  44.5    45    46    47 
##     1   104     2    86     1    63     1    75     4   101    64    91 
##    48  48.5    49    50  50.5    51  51.5    52  52.5    53    54    55 
##    66     7    82    64     2    54     1    72     4    68    61    58 
##    56    57    58    59  59.5    60  60.5    61  61.5    62    63    64 
##    42    44    37    39     2    38     2    47     1    29    30    23 
##  64.5    65    66    67    68    69    70  70.5    71    72    73  73.5 
##     1    14    17    22    24    17    11     1     5     6     8     4 
##    74    75    76    77  77.5    78    79  79.5    80    81    82  82.5 
##     5     7     5     5     1     4     2     4     1     7     2     1 
##    83    85    86    87    88    89    93    95    96    97    98   101 
##     4     2     2     4     1     1     1     1     3     1     3     2 
##   105   108   110   112 118.5 122.5   124   128   131 138.5 146.5   289 
##     2     3     1     1     1     1     1     1     1     1     1     1

Seems like a bin size of 1 would work

ggplot(aes(x=free.sulfur.dioxide), 
       data=subset(wqw, free.sulfur.dioxide < quantile(free.sulfur.dioxide, .99))) + geom_histogram(binwidth=1) 

plot of chunk unnamed-chunk-38

The value distribution have a quite flatted normal shape.

Total Sulfure Dioxide (mg/L)

A total amound of S02 in mg/dm3. It include the free sulfure dioxide.

ggplot(aes(x=total.sulfur.dioxide), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-39

Again a few outlines and what seems like a normal distribution. Let's ajust the bin size and remove the outliners.

summary(wqw$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
table(wqw$total.sulfur.dioxide)
## 
##     9    10    18    19    21    24    25    26    28    29    30    31 
##     1     1     2     1     1     3     1     1     4     2     2     1 
##    33    34    37    40    41    44    45    46    47    48    49    50 
##     1     2     3     3     4     1     2     2     3     1     4     3 
##    51    53    54    55    56    57    58    59    60    61    62    63 
##     3     2     2     7     5     7     2     5     6     9     2    10 
##    64    65    66    67    68    69    70    71    72    73    74    75 
##     6     8     7    12    14    10     8    12    17    20    12    14 
##    76    77    78    79    80    81    82    83    84    85    86    87 
##    26    14    17    15    23    21    17    17    27    20    25    39 
##    88    89    90    91    92    93    94    95    96    97    98    99 
##    15    23    30    22    30    42    28    34    28    41    49    34 
##   100   101   102   103   104   105   106   107   108   109   110   111 
##    37    47    37    34    44    41    32    45    32    37    47    69 
##   112   113   114   115 115.5   116   117   118   119   120   121   122 
##    31    61    54    45     1    47    57    55    47    42    37    54 
##   123   124   125   126   127   128   129 129.5   130   131   132   133 
##    33    53    49    50    38    54    32     2    46    47    47    50 
##   134   135   136   137   138   139   140   141   142   143   144   145 
##    47    41    38    27    45    28    52    29    46    44    35    30 
##   146   147   148   149   150   151   152   153   154   155   156   157 
##    31    31    44    48    54    39    43    32    27    39    47    31 
##   158   159   160   161   162 162.5   163   164 164.5   165   166   167 
##    38    34    32    37    34     2    36    27     1    19    39    32 
##   168   169   170   171   172   173   174   175   176 176.5   177   178 
##    43    29    32    27    28    32    28    16    24     1    27    41 
##   179   180   181   182   183   184   185   186   187   188   189 189.5 
##    26    34    21    30    35    30    18    25    19    23    30     3 
##   190   191   192   193   194   195   196   197   198   199   200   201 
##    17    28    18    15    21    17    16    28    18    10    18    16 
##   202   203   204   205   206   207   208   209   210   211   212 212.5 
##    13     7    13    12    14    10    10    11    23     8    15     6 
##   213   214   215   216 216.5   217 217.5   218 218.5   219 219.5   220 
##    14    10    10     8     1     4     1     4     3     6     1     7 
##   221   222   223   224   225   226   227   228   229   230   231   232 
##    13     7     9     9     4     3     8     8     9     6     5     1 
##   233   234 234.5   235   236   237   238 238.5   240   241   242   243 
##     2     7     1     2     3     3     5     1     7     2     2     6 
##   244   245   246   247   248   249 249.5   251   252   253   255   256 
##     2     5     1     3     3     2     1     4     2     3     1     2 
##   259   260   272   282   294   303 307.5   313   344 366.5   440 
##     1     1     2     1     1     1     1     1     1     1     1
ggplot(aes(x=total.sulfur.dioxide), 
       data=subset(wqw, total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) + 
  geom_histogram(binwidth=1) 

plot of chunk unnamed-chunk-42

The data distribution has a lot of noise, maybe some wider bin would attenuate this noise.

ggplot(aes(x=total.sulfur.dioxide), 
       data=subset(wqw, total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) +
  geom_histogram(binwidth=3)

plot of chunk unnamed-chunk-43

Additional note on Total Sulfur Dioxide and the "Contains sulfites” indication

As often one see “contain sulfites” on wine bottle because less than 1% of the population is sulfit-sensitive. The label must be present with concentration higher than 10ppm. In the US the maximum authorized is 350ppm. It is also used as a measure for organic wine with maximum of 100ppm. [http://waterhouse.ucdavis.edu/whats-in-wine/sulfites-in-wine]

For liquide 1mg/L approximate of 1ppm. So if we want to represent those thresold on the graphe.

ggplot(aes(x=total.sulfur.dioxide), 
       data=wqw) +
  geom_histogram(binwidth=3) +
  scale_x_continuous(breaks=c(10,100,350))

plot of chunk unnamed-chunk-44

It seems that all our white wines would have display in the “Contains Sulfites”. Still a portion of them could be consider are organic. 2 wines of our sample would not be authorized in the US.

Apparently this 10ppm thresold is health issue than anything to do with wine quality but still let's create a new variable contains.sulfies with 3 groups less than 10, between 10 and 100 and more than 100

wqw$contains.sulfites <- cut(wqw$total.sulfur.dioxide, labels=c('no', 'negligable', 'low', 'normal', 'high'), breaks=c(0, 1, 10, 100, 350, 800))
prop.table(table(wqw$contains.sulfites))
## 
##           no   negligable          low       normal         high 
## 0.0000000000 0.0004083299 0.1880359330 0.8111474071 0.0004083299

Our sample contains:

Ratio Free Sulfur Dioxide and Total Sulfur Dioxide

According to the practical winemaker journal [http://www.practicalwinery.com/janfeb09/page2.htm] the ratio between free SO2 and total S02 is key for the preservation of the wine. So let's explore this ratio

ggplot(aes(x=free.sulfur.dioxide/total.sulfur.dioxide), data=wqw) +
  geom_histogram(binwidth=0.01)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-47

We get a normal distribution of the ratio. Most of the values are contain between 10% to 40%

The article also mention that For dry table wines the level of free sulfur is usually somewhere around 40% to 75% of the level of total SO2. Well let's cross check with our sample.

ggplot(aes(x=free.sulfur.dioxide/total.sulfur.dioxide), 
       data=subset(wqw, sweetness == 'dry')) +
  geom_histogram(binwidth=0.01) +
  scale_x_continuous(breaks=c(.4,.75))
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-48

Very few of our dry wine sample are contained in 40% to 75% ratio. Most of our wine are below 40%. After reading multiple time and double checking my variables and the article, i cannot figure out how our sample ratio is so different.

As this ratio seems important into wine conservation, let's add it as a variable keeping in mind that we couldn't really validate our values.

wqw$ratio.sulfur.dioxide <- wqw$free.sulfur.dioxide/wqw$total.sulfur.dioxide

Density (g/cm3)

ggplot(aes(x=density), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-50 Let's look at the summary and the table to choose appropriate binwidth.

summary(wqw$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
table(wqw$density)
## 
##  0.98711  0.98713  0.98722   0.9874  0.98742  0.98746  0.98758  0.98774 
##        1        1        1        1        2        2        1        1 
##  0.98779  0.98794  0.98802  0.98815  0.98816  0.98819  0.98822  0.98823 
##        1        2        1        1        1        1        1        1 
## 0.988245  0.98834  0.98836   0.9884  0.98845  0.98853  0.98854  0.98856 
##        1        1        2        1        1        1        1        2 
##   0.9886  0.98862  0.98865  0.98867  0.98868  0.98869   0.9887  0.98871 
##        2        3        3        2        1        1        2        2 
##  0.98872  0.98876  0.98878   0.9888  0.98882  0.98883  0.98884  0.98886 
##        2        2        1        1        1        1        1        2 
##  0.98889   0.9889  0.98892  0.98894  0.98895  0.98896  0.98898    0.989 
##        3        5        5        3        2        2        1        4 
##  0.98902  0.98904  0.98906   0.9891  0.98912  0.98913  0.98914  0.98915 
##        1        1        2        2        5        1        5        1 
##  0.98916  0.98918  0.98919   0.9892  0.98922  0.98923  0.98924  0.98926 
##        4        4        1        7        1        1        3        6 
##  0.98928   0.9893  0.98931 0.989315  0.98934  0.98935  0.98936  0.98938 
##        2        8        2        1        5        1        5        2 
##  0.98939   0.9894  0.98941  0.98942 0.989435  0.98944  0.98945  0.98946 
##        2        6        1        5        1        7        2        7 
## 0.989465  0.98947  0.98948  0.98949   0.9895  0.98951  0.98952  0.98953 
##        1        1        2        4        6        1        7        3 
##  0.98954  0.98956  0.98958  0.98959   0.9896  0.98961  0.98962  0.98963 
##        3        5        4        3        9        5        2        6 
##  0.98964  0.98965  0.98966  0.98968   0.9897  0.98972  0.98974  0.98975 
##        9        1        3        4        6        4        3        3 
##  0.98976  0.98978   0.9898  0.98981  0.98982  0.98984  0.98985  0.98986 
##        2        1       18        3        1        6        2        3 
##  0.98987  0.98988   0.9899  0.98992  0.98993  0.98994  0.98995  0.98997 
##        2        4        8        1        4        3        1        1 
##  0.98998  0.98999     0.99  0.99001  0.99002  0.99004  0.99005  0.99006 
##        4        3       28        2        4        4        1        3 
##  0.99007  0.99008  0.99009   0.9901  0.99011  0.99012  0.99013  0.99014 
##        1        4        1        8        1        3        1        5 
##  0.99015  0.99016  0.99018  0.99019   0.9902  0.99021  0.99022  0.99024 
##        1        4        5        1       20        6        5        3 
##  0.99026  0.99027  0.99028   0.9903  0.99031  0.99032  0.99033  0.99034 
##       10        1        3       13        4        3        3        1 
##  0.99035  0.99036  0.99037  0.99038   0.9904  0.99041  0.99042  0.99043 
##        7        8        1        3       13        1        2        5 
##  0.99044  0.99045  0.99046  0.99047  0.99048   0.9905  0.99051  0.99052 
##        7        4        2        3        4        9        1        4 
##  0.99053  0.99054  0.99055  0.99056  0.99057  0.99058  0.99059   0.9906 
##        2        2        1        3        3        9        2       32 
##  0.99061  0.99062  0.99063  0.99064  0.99065  0.99066  0.99067  0.99068 
##        1        5        1        4        1        7        5        3 
##  0.99069   0.9907  0.99071  0.99072  0.99074  0.99075  0.99076  0.99077 
##        2       13        1        3        5        2       15        1 
##  0.99078  0.99079   0.9908  0.99081  0.99082  0.99084  0.99085  0.99086 
##        2        1       25        1        4        8        6        3 
##  0.99088  0.99089   0.9909  0.99091  0.99092  0.99093  0.99094  0.99095 
##        5        7       13        2        4        2        5        3 
##  0.99096  0.99097  0.99098  0.99099    0.991  0.99102  0.99103  0.99104 
##        4        2        4        2       34        1        1        4 
##  0.99105  0.99106  0.99107  0.99108  0.99109   0.9911  0.99111  0.99112 
##        2        2        1        3        2       25        5        6 
##  0.99114  0.99115  0.99116  0.99117  0.99118  0.99119   0.9912  0.99121 
##        8        1        5        1        2        3       33        3 
##  0.99122  0.99123  0.99124  0.99125  0.99126  0.99127  0.99128  0.99129 
##        3        4        3        3        8        1        4        3 
##   0.9913  0.99132  0.99133  0.99134  0.99135  0.99136  0.99137  0.99138 
##       16        6        2        5        2        3        1        9 
##  0.99139   0.9914  0.99142  0.99143  0.99144  0.99146  0.99148   0.9915 
##        2       39        7        3        8        7        4       10 
##  0.99151  0.99152  0.99153  0.99154  0.99155  0.99156  0.99157  0.99158 
##        4        5        3        5        3        3        1        6 
##  0.99159   0.9916  0.99161  0.99162  0.99163  0.99164  0.99165  0.99166 
##        4       23        2        4        1       11        6        7 
##  0.99167  0.99168   0.9917  0.99171  0.99172  0.99173  0.99174  0.99175 
##        1        7       34        1        5        4        7        3 
##  0.99176  0.99177  0.99178  0.99179   0.9918  0.99182  0.99183  0.99184 
##       11        2        8        1       40        6        1       12 
##  0.99185  0.99186  0.99188  0.99189   0.9919  0.99192  0.99193  0.99194 
##        5        6        8        4       14        5        2        4 
##  0.99195  0.99196  0.99198  0.99199    0.992  0.99201  0.99202  0.99203 
##        2        4        6        2       64        1        5        1 
##  0.99204  0.99205  0.99206  0.99207  0.99208  0.99209   0.9921  0.99211 
##        4        1        4        4        3        1       16        1 
##  0.99212  0.99214  0.99215  0.99216  0.99218   0.9922  0.99221  0.99222 
##       13        3        6        8        4       27        2        3 
##  0.99223  0.99224  0.99225  0.99226  0.99228  0.99229   0.9923  0.99232 
##        1        6        3        9        5        1       19        4 
##  0.99234  0.99235  0.99236  0.99237  0.99238  0.99239   0.9924  0.99241 
##        6        4        1        2        7        2       44        2 
##  0.99242  0.99243  0.99244  0.99245  0.99246  0.99248  0.99249   0.9925 
##        3        3        8        2        3        3        2       22 
##  0.99251  0.99252  0.99253  0.99254  0.99255  0.99256  0.99257  0.99258 
##        1        3        1        5        3        5        2        1 
##   0.9926  0.99261  0.99262  0.99264  0.99265  0.99266  0.99267  0.99268 
##       32        1        2        1        2        5        1        6 
##  0.99269   0.9927  0.99271  0.99272  0.99273  0.99274  0.99275  0.99276 
##        2       47        3        5        2        4        1        3 
##  0.99278  0.99279   0.9928  0.99281  0.99282  0.99283  0.99284  0.99286 
##        8        1       61        1        3        2        1        4 
##  0.99287  0.99288  0.99289   0.9929  0.99293  0.99294  0.99295  0.99296 
##        3        5        1       20        3        1        1        7 
##  0.99297  0.99298  0.99299    0.993  0.99302  0.99304  0.99305  0.99306 
##        4        1        4       52        1       10        4        5 
##  0.99307  0.99308  0.99309   0.9931  0.99311  0.99312  0.99313  0.99314 
##        3        3        1       28        1        4        3        6 
##  0.99315  0.99316  0.99317  0.99318  0.99319   0.9932  0.99321  0.99322 
##        3        4        1        4        1       53        3        3 
##  0.99323  0.99324  0.99325  0.99326  0.99328  0.99329   0.9933  0.99331 
##        1        6        1        7        3        1       17        2 
##  0.99332  0.99334  0.99335  0.99336  0.99338  0.99339   0.9934  0.99341 
##        4        5        5        3        6        2       50        1 
##  0.99342  0.99344  0.99345  0.99346  0.99347  0.99348   0.9935  0.99352 
##        1        4        2        2        4        3       17        5 
##  0.99353  0.99354  0.99355  0.99356  0.99358   0.9936  0.99361  0.99362 
##        1        4        1        3        3       34        1       14 
##  0.99364  0.99365  0.99366  0.99367  0.99368   0.9937  0.99372  0.99373 
##        4        3        4        1        5       35        1        5 
##  0.99374  0.99375  0.99376  0.99378  0.99379   0.9938  0.99381  0.99382 
##        2        2        1        3        1       49        1        8 
##  0.99383  0.99384  0.99385  0.99386  0.99388   0.9939  0.99391  0.99392 
##        2        2        1        1        6       28        2        3 
##  0.99393  0.99394  0.99395  0.99396  0.99397  0.99398  0.99399    0.994 
##        1        4        1        4        4        5        1       37 
##  0.99402  0.99403  0.99404  0.99405  0.99406  0.99407  0.99408   0.9941 
##        4        1        2        3        6        1        4       25 
##  0.99411  0.99412  0.99413  0.99414  0.99415  0.99416  0.99418   0.9942 
##        3        2        1        2        1        3        3       38 
##  0.99422  0.99424  0.99425  0.99426  0.99427  0.99428  0.99429   0.9943 
##        3        2        4        2        1        5        3       11 
##  0.99432  0.99433  0.99434  0.99435  0.99436  0.99437  0.99438  0.99439 
##        6        1        4        2        1        3        7        1 
##   0.9944  0.99441  0.99442  0.99444  0.99445  0.99449   0.9945  0.99452 
##       46        2        2        3        6        4       22        4 
##  0.99453  0.99454  0.99455  0.99456  0.99457  0.99458  0.99459   0.9946 
##        1        7        5        4        1        5        2       32 
##  0.99461  0.99462  0.99463  0.99464  0.99466  0.99468  0.99469   0.9947 
##        2        3        1        1        2        2        4        9 
##  0.99471  0.99472  0.99473  0.99474  0.99475  0.99476  0.99477  0.99478 
##        6        3        2        6        4        1        1        4 
##  0.99479   0.9948  0.99481  0.99482  0.99485  0.99486  0.99488  0.99489 
##        4       45        2        3        1        3        4        3 
##   0.9949  0.99492  0.99494  0.99495  0.99496  0.99497  0.99498  0.99499 
##       20        2        6        2        4        2        2        1 
##    0.995  0.99502  0.99504  0.99505  0.99506  0.99507  0.99508  0.99509 
##       25        4        1        3        1        1        4        2 
##   0.9951  0.99511  0.99512  0.99513  0.99514  0.99516  0.99517  0.99518 
##       25        1        7        2        5        4        2        4 
##  0.99519   0.9952  0.99521  0.99522  0.99523  0.99524  0.99526  0.99527 
##        3       37        2        1        2        3        3        2 
##  0.99528   0.9953  0.99532  0.99534  0.99535  0.99536  0.99537  0.99538 
##        4       32        4        5        2        3        4        4 
##  0.99539   0.9954  0.99541  0.99542  0.99543  0.99544  0.99545  0.99546 
##        1       44        2        8        2        7        4        8 
##  0.99548   0.9955  0.99551  0.99552  0.99553  0.99554  0.99555  0.99556 
##        4       30        4        3        1        1        2        7 
##  0.99558   0.9956  0.99561  0.99562  0.99563  0.99564  0.99565  0.99566 
##        8       41        2        3        1        6        1        5 
##  0.99567  0.99568   0.9957  0.99571  0.99572  0.99573  0.99574  0.99576 
##        3        4       15        4        5        3        1        7 
##  0.99577  0.99578  0.99579   0.9958  0.99581  0.99582  0.99583  0.99584 
##        2        6        3       40        4        5        1        2 
##  0.99585  0.99586  0.99587  0.99588   0.9959  0.99591  0.99592  0.99594 
##        1        3        8        2       23        1        4        3 
##  0.99595  0.99596    0.996  0.99601  0.99602  0.99604  0.99605  0.99606 
##        1        5       20        1        2        7        2        2 
##  0.99608   0.9961  0.99611  0.99612  0.99615  0.99616   0.9962  0.99622 
##        1       16        1        4        1        2       31        6 
##  0.99624  0.99625  0.99626  0.99627  0.99628  0.99629   0.9963  0.99632 
##        1        1        3        2        6        1       18        2 
##  0.99634  0.99636   0.9964  0.99642  0.99644  0.99645  0.99646   0.9965 
##        1        3       18        7        2        1        1       18 
##  0.99652  0.99654  0.99655  0.99656  0.99657  0.99658  0.99659   0.9966 
##        4        3        3        1        4        2        3       36 
##  0.99662  0.99663  0.99665  0.99666  0.99668  0.99669   0.9967  0.99672 
##        2        2        3        9        1        1       13        3 
##  0.99674  0.99675  0.99676  0.99677  0.99678  0.99679   0.9968  0.99681 
##        1        2        5        1        5        1       20        1 
##  0.99683  0.99684  0.99685  0.99687  0.99688   0.9969  0.99691  0.99692 
##        1        3        2        2        2       15        5        5 
##  0.99693  0.99695  0.99696  0.99698  0.99699    0.997  0.99702  0.99704 
##        2        1        1        1        5       21        1        3 
##  0.99705  0.99706  0.99708  0.99709   0.9971  0.99711  0.99712  0.99713 
##        8        2        2        1       11        4        1        1 
##  0.99714  0.99715  0.99716  0.99718   0.9972  0.99724  0.99725  0.99726 
##        1        1        1        4       33        2        1        4 
##  0.99727  0.99728   0.9973  0.99732  0.99734  0.99736  0.99737  0.99738 
##        3        2        7        3        1        3        1        1 
##   0.9974  0.99741  0.99742  0.99745  0.99748   0.9975  0.99751  0.99752 
##       28        1        6        2        2       17        1        3 
##  0.99754  0.99755  0.99756  0.99758   0.9976  0.99767  0.99769   0.9977 
##        6        3        3        2       34        2        1       12 
##  0.99771  0.99772  0.99773  0.99775  0.99776  0.99778  0.99779   0.9978 
##        2        4        7        2        4        2        1       23 
##  0.99782  0.99784  0.99785  0.99786  0.99787  0.99788   0.9979  0.99792 
##        5        8        1        2        2        1       24       10 
##  0.99794  0.99795    0.998  0.99801  0.99802  0.99803  0.99804  0.99805 
##        4        2       35        1        2        2        2        2 
##  0.99806  0.99807  0.99808   0.9981  0.99814  0.99815   0.9982  0.99822 
##        1        8        8       15        1        5       28        3 
##  0.99824  0.99825  0.99827 0.998275  0.99828   0.9983  0.99831  0.99833 
##        1        8        2        1        2       21        1        1 
##  0.99834  0.99835  0.99836 0.998365  0.99837  0.99838  0.99839   0.9984 
##        4        4        2        1        1        2        3       29 
##  0.99841  0.99845  0.99848   0.9985  0.99851  0.99853  0.99855  0.99856 
##        1        1        3        4        2        1        7        1 
##  0.99858   0.9986  0.99862  0.99863  0.99864  0.99865  0.99869   0.9987 
##        1       42        5        2        1        2        2        9 
##  0.99872  0.99873   0.9988  0.99882  0.99884  0.99886  0.99888   0.9989 
##        2        1       10        2        8        2        3        5 
##  0.99896  0.99898  0.99899    0.999  0.99902  0.99904  0.99906  0.99907 
##        4        3        1       11        1        3        5        5 
##  0.99908   0.9991  0.99911  0.99916  0.99918   0.9992  0.99922  0.99924 
##        2        9        3        2        1        5        5        4 
##   0.9993  0.99935  0.99936  0.99938   0.9994  0.99941  0.99942  0.99943 
##        6        1        1        1        9        1        2        1 
##  0.99944  0.99945  0.99946  0.99947   0.9995  0.99954  0.99955  0.99956 
##        1        3        7        2        3        3        1        2 
##   0.9996  0.99965  0.99966   0.9997  0.99971  0.99975  0.99976   0.9998 
##        8        1        1        6        2        3        6       17 
##  0.99985   0.9999        1   1.0001  1.00013  1.00014  1.00016   1.0002 
##        1        9       19       11        2        2        1        7 
##  1.00022   1.0003  1.00037  1.00038   1.0004  1.00044  1.00047   1.0005 
##        1        3        2        2        9        2        1        2 
##  1.00051  1.00055   1.0006   1.0007   1.0008  1.00098    1.001   1.0011 
##        1        1        4        1        3        1        5        2 
##  1.00118   1.0012   1.0017  1.00182  1.00196   1.0024  1.00241  1.00295 
##        1        1        2        1        1        1        1        2 
##   1.0103  1.03898 
##        2        1

Pretty difficult choice as the precision of the measure is going down to 0.00001

ggplot(aes(x=density), 
       data=subset(wqw, density < quantile(density, .99))) +
  geom_histogram(binwidth=.00001) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-53

The data is very noisy. Let's increase the bin size.

ggplot(aes(x=density), 
       data=subset(wqw, density < quantile(density,.99))) + 
  geom_histogram(binwidth=.0001) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-54 The density distribution seems normal and trimodal.

pH

pH ss a indicator of how acidic or basic the wine is.

ggplot(aes(x=pH), data=wqw) + geom_histogram() 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-55

No outliner here, but let see if we can adjust the binwidth.

Let see the values

table(wqw$pH)
## 
## 2.72 2.74 2.77 2.79  2.8 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89  2.9 2.91 
##    1    1    1    3    3    1    4    1    9    9    9   11   17   31   15 
## 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99    3 3.01 3.02 3.03 3.04 3.05 3.06 
##   18   38   35   26   63   32   41   68   74   49   68   78   97   89  115 
## 3.07 3.08 3.09  3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19  3.2 3.21 
##   79  136   92  135  126  134  117  172  136  164  124  138  145  137   95 
## 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29  3.3 3.31 3.32 3.33 3.34 3.35 3.36 
##  146  116  132  114   96   88   87   82   93   79   86   49   79   48   83 
## 3.37 3.38 3.39  3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49  3.5 3.51 
##   49   58   40   39   30   48   20   33   17   28   21   21   23   15   14 
## 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59  3.6 3.61 3.62 3.63 3.64 3.65 3.66 
##   17   13   14    9    8    5    5    6    7    3    1    6    2    4    5 
## 3.67 3.68 3.69  3.7 3.72 3.74 3.75 3.76 3.77 3.79  3.8 3.81 3.82 
##    1    2    2    1    3    2    2    2    2    1    2    1    1

There is a 0.01 precision on the measurements.

ggplot(aes(x=pH), data=wqw) + geom_histogram(binwidth=0.01) 

plot of chunk unnamed-chunk-57

The pH seems to follow normal distribution.

summary(wqw$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The mean 3.188 and median 3.180 are nearly indentical. So all our white wine are acidic with value between 2.7 and 3.8.

Sulphates (g/L)

Sulphates (or potassium solphate) are a wine additive for antimicrobial and antioxidant. It can also be use as fertilizer [http://www.solufeed.co.uk/solufeed-news/articles/2013/august/foliar-potassium-enhances-wine-quality.aspx].

ggplot(aes(x=sulphates), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-59

Let's look at the value granularity

table(wqw$sulphates)
## 
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 
##    1    1    4    4   13   13   16   31   35   54   59   84   85  120  129 
## 0.38 0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##  214  151  168  139  181  161  216  178  225  172  179  166  249  140  156 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##  135  167  102  108   83   99   97   88   45   68   48   67   28   36   35 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   44   30   27   18   33   12   19   22   19   16   19   16    5    5   13 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99 
##    2    4    3    2    2    7    1    5    2    2    5    3    1    6    1 
##    1 1.01 1.06 1.08 
##    1    1    1    1

Seems like a 0.01 would fit our bin size.

ggplot(aes(x=sulphates), data=wqw) + geom_histogram(binwidth=0.01) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-61

The data curve is kind of normal and bimodal. From the table we can find a peak at 0.38 and at 0.5. We can also more cleary spoted some outliner above 1.0g/L

summary(wqw$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Most values lies between 0.41g/L and 0.55g/L

Alcohol (%)

Alcohol is quite self explanatory… as a percentage per volume. 11.6% is consider as a global average.

ggplot(aes(x=alcohol), data=wqw) + geom_histogram() 

plot of chunk unnamed-chunk-63

Let try to adjust our bin size.

table(wqw$alcohol)
## 
##                8              8.4              8.5              8.6 
##                2                3                9               23 
##              8.7              8.8              8.9                9 
##               78              107               95              185 
##              9.1              9.2              9.3              9.4 
##              144              199              134              229 
##              9.5 9.53333333333333             9.55              9.6 
##              228                3                2              128 
## 9.63333333333333              9.7 9.73333333333333             9.75 
##                1              105                2                1 
##              9.8              9.9               10 10.0333333333333 
##              136              109              162                1 
##             10.1 10.1333333333333            10.15             10.2 
##              114                2                3              130 
##             10.3             10.4 10.4666666666667             10.5 
##               85              153                2              160 
## 10.5333333333333            10.55 10.5666666666667             10.6 
##                1                2                1              114 
##            10.65             10.7             10.8             10.9 
##                1               96              135               88 
## 10.9333333333333 10.9666666666667            10.98               11 
##                2                3                1              158 
##            11.05 11.0666666666667             11.1             11.2 
##                2                1               83              112 
## 11.2666666666667             11.3 11.3333333333333            11.35 
##                1              101                3                1 
## 11.3666666666667             11.4 11.4333333333333            11.45 
##                1              121                1                4 
## 11.4666666666667             11.5            11.55             11.6 
##                1               88                1               46 
## 11.6333333333333            11.65             11.7 11.7333333333333 
##                2                1               58                1 
##            11.75             11.8            11.85             11.9 
##                2               60                1               53 
##            11.94            11.95               12            12.05 
##                2                1              102                1 
## 12.0666666666667             12.1            12.15             12.2 
##                1               51                2               86 
##            12.25             12.3 12.3333333333333             12.4 
##                1               62                1               68 
##             12.5             12.6             12.7            12.75 
##               83               63               56                3 
##             12.8 12.8933333333333             12.9               13 
##               54                2               39               36 
##            13.05             13.1 13.1333333333333             13.2 
##                1               18                1               14 
##             13.3             13.4             13.5            13.55 
##                7               20               12                1 
##             13.6             13.7             13.8             13.9 
##                9                7                2                3 
##               14            14.05             14.2 
##                5                1                1

The data precision is 0.1

ggplot(aes(x=alcohol), data=wqw) + geom_histogram(binwidth=0.1) 
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-65

Univariate Analysis

What is the structure of the dataset?

There are 4,898 white wines in the dataset with 13 variables:

Main observations:

What is/are the main feature(s) of interest in your dataset?

The most important feature is the quality. For the rest of the features, it's not easy at this stage to clearly identify which one is really important. A good wine is a well balanced composition that doesn't seems connected one particular chimical components.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Still difficult to indentify which feature will help, but the density, the alcohol and suflur dioxine, volatile acidity (the vinager taste) might be more helpful.

Did you create any new variables from existing variables in the dataset?

I created 3 categorical variables and 1 continious variable.

The first categorical is sweetness. The residual.sugar has been used to categorize the wines.

The second categorical is contains.sulfites. It's more a reglementation mark than any taste category but it could be interesting.

The third categorical is add.citric.acid. A boolean to mark the wine with an non-normal concentration of citric acid.

The continious varible is ratio.sulfur.dioxide, the ratio of free.sulfur.dioxide over the total.sulfur.diovide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The residual sugar had a kind of long tail distribution. By doing a log10 transformation it became a bimodal normal distribution. I didn't changed the value but will keep in mind this property of the distribution.

Bivariate Plots Section

Scatter Plot Matrix

set.seed(231)
sample.ids <- sample(levels(wqw$X), 2000)
ggpairs(subset(wqw, X %in% sample.ids )[,2:18])

plot of chunk unnamed-chunk-66

Density vs Residual Sugar

According to the matrix the density and residual.sugar have a strong correlation at 0.83. Let's visualize in a scaterplot.

ggplot(aes(x=density, y=residual.sugar), 
       data=wqw) + 
  geom_point(alpha=.2)

plot of chunk unnamed-chunk-67

It looks like a linera relstionship.

ggplot(aes(x=density, y=residual.sugar), 
       data=subset(wqw, residual.sugar<30)) + 
  geom_point(alpha=.5) +
  geom_smooth(method="lm")

plot of chunk unnamed-chunk-68

Well we have a strong relashionship and it definitly make sense. Indeed the more you add suggar in liquide, the more liquide will disolve the sugar and increase in density.

Alcohol vs Density

A second strong correlation number is between the alcohol and the density with -0.78. Let's create a scater plot to explore this relationship.

ggplot(aes(x=alcohol, y=density), 
      data=subset(wqw, density < 1.01)) +
  geom_jitter(alpha=.2)

plot of chunk unnamed-chunk-69

ggplot(aes(x=alcohol, y=density), 
      data=subset(wqw, density < 1.01)) +
  geom_jitter(alpha=.2) +
  geom_smooth(method="lm")

plot of chunk unnamed-chunk-70

The alcohol and density seem to follow a linear relationship. Which make definitly sense as the density of alcohol is lower than the water ( which is 1). The more concentrate in alcohol the more the density is going down.

Total Sulfur Dioxide vs Density

A third correlation number is a moderate 0.53 between the total sulfur dioxide and the density.

ggplot(aes(x=total.sulfur.dioxide, y=density), 
      data=subset(wqw, density < 1.01)) +
  geom_jitter(alpha=.2)

plot of chunk unnamed-chunk-71

The scater plot is not very convincing. It looks like a small correlation relationship.

Total Sulfur Dioxide vs Residual Sugar

Between the total sulfur dioxide and the residual sugar, there is correlation moderate coefficient of 0.47. Let's have a closer look.

ggplot(aes(x=total.sulfur.dioxide, y=log10(residual.sugar)), 
      data=subset(wqw, residual.sugar < quantile(residual.sugar, .99))) +
  geom_jitter(alpha=.2)

plot of chunk unnamed-chunk-72

A bit confusing to get any information from this graph. An additional variable might be usefull here.

Alcohol vs Quality

A positive moderate correlation number of 0.43 was spotted in the matric between the quality and the level of alcohol.

ggplot(aes(x=as.factor(quality), y=alcohol), data=wqw) + 
  geom_boxplot()

plot of chunk unnamed-chunk-73

Look like the good wine of our sample have more alcohol. In average higher quality wines contain more alcohol than the average wines. Note that the average wine quality have a lower alcohol than the worst wine quality.

pH vs Fixed Acidity

Another moderate negative correlation number of -0.45 between the pH and the fixed acidity.

ggplot(aes(x=pH, y=fixed.acidity), 
       data=wqw) + 
  geom_jitter(alpha=.2) +
  geom_smooth(method='lm')

plot of chunk unnamed-chunk-74

We clearly see that the more fixed acidity the lower the pH. This makes totally sense as the low pH is more acid.

Total Sulfure Dioxide vs Free Sulfure Dioxide

The total and the free sulfure dioxides have a correlation coefficient of 0.61. Let's investigate more.

ggplot(aes(x=total.sulfur.dioxide, y=free.sulfur.dioxide), data=wqw) + 
  geom_point(alpha=.2) +
  coord_cartesian(xlim=c(0,300)) +
  geom_smooth(method='lm')

plot of chunk unnamed-chunk-75

The relationship look linear. Which in a way make sense as free sulfure dioxide is part of the total sulfure dioxide. Let's now plot the relationship between the total - fee vs free.

ggplot(aes(x=total.sulfur.dioxide - free.sulfur.dioxide, y=free.sulfur.dioxide), data=wqw) + 
  geom_point(alpha=.2) +
  coord_cartesian(xlim=c(0,300)) +
  geom_smooth(method='lm')

plot of chunk unnamed-chunk-76

Well not very conclusive, we arrive at a rather low correlation shape.

cor.test(x = wqw$total.sulfur.dioxide - wqw$free.sulfur.dioxide,
         y = wqw$free.sulfur.dioxide)
## 
##  Pearson's product-moment correlation
## 
## data:  wqw$total.sulfur.dioxide - wqw$free.sulfur.dioxide and wqw$free.sulfur.dioxide
## t = 19.1158, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2372821 0.2894077
## sample estimates:
##       cor 
## 0.2635373

Only a weak 0.26 correlation coefficient.

Ratio Sulfur Dioxide vs pH

I read that the ratio sulfur dioxide influence the pH. Let's check if we get something….

ggplot(aes(x=pH, y=ratio.sulfur.dioxide), data=wqw) + 
  geom_point(alpha=.2)

plot of chunk unnamed-chunk-78

Well it look like a correlation of 0… Definitly no related.

pH vs Quality

Let's compare pH in different quality

ggplot(aes(x=as.factor(quality), y=pH), data=wqw) + 
   geom_boxplot()

plot of chunk unnamed-chunk-79

The very best wines have a very controlled/narrowed pH. As opposed as the worst wines that are more spread and lower -more acidic- pH. There is much more outliners for average wines quality (5 and 6) but those are the vast majority of our sample. The quality 5 has the lowest mean of pH.

Chlorides vs Quality

Chloride (or salt) is a great taste enhancer, let see the relationship with quality

ggplot(aes(x=as.factor(quality), y=chlorides), data=wqw) + 
   geom_boxplot()

plot of chunk unnamed-chunk-80

Well best wines don't have a low level of clorine and again the biggest quality. We can spot again many outliners for the quality 5 and 6. Let's try to get more details.

ggplot(aes(x=as.factor(quality), y=chlorides), data=wqw) + 
  geom_boxplot() +
  coord_cartesian(ylim=c(.01, .075))

plot of chunk unnamed-chunk-81

The better the wine, the lower the chloride level. Except for the worst wine graded 3 and 4, are they not even worth a bit of chloride?

Volatile Acidity vs Quality

Too much volatile acidity is supposed to produce the vinager smell of the wine. Let's see if the worst wine are the one with a vinager smell

ggplot(aes(x=as.factor(quality), y=volatile.acidity), data=wqw) + 
  geom_boxplot() 

plot of chunk unnamed-chunk-82

Actually the worst wines (quality 3) don't have the highest level of volatile acidity. However the wines of quality 4 have the highest average concentration and a few high outliners.

Density vs Quality

Let's compare density in different quality groups

ggplot(aes(x=as.factor(quality), y=density), data=wqw) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(.985, 1))  

plot of chunk unnamed-chunk-83

The best wines (quality 7, 8 and 9) have in average a lower density.

ggplot(aes(x=quality, y=density), data=wqw) + 
  geom_jitter(alpha=.1) +
  coord_cartesian(ylim = c(.985, 1))  

plot of chunk unnamed-chunk-84

Total Sulfur Dioxide vs Quality

Let's compare total sulfur dioxide according to quality groups

ggplot(aes(x=as.factor(quality), y=total.sulfur.dioxide), data=wqw) + 
   geom_boxplot() 

plot of chunk unnamed-chunk-85

Intersting plot as the better the quality the more narrow the variation of total sulfur dioxide. It's as if the best wine producers are more in control of the sulfur dioxide and don't let it variate much.

Citric Acid (and non-normal citric acid) vs Quality

Let see if the wine with those non-normal levels of citric acid are rated in quality.

ggplot(aes(x=as.factor(quality), y=citric.acid), data=wqw) + 
  geom_jitter(aes(color=added.citric.acid), alpha=.2) +
  coord_cartesian(ylim=c(0,1))

plot of chunk unnamed-chunk-86

ggplot(aes(x=citric.acid), data=wqw) + 
  geom_histogram(binwidth=.01) +
  facet_wrap(~ quality, scales='free')

plot of chunk unnamed-chunk-87

Well the non normal concentration for qualities 4, 5, 6 and 7. For 3, 8 and 9 you cannot spot a peak at 0.49 and 0.74.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

On one hand two features have a positive effect on the density the sugar and total suflure dioxide. On the other hand the alcohol has a negative effect on the density.

The best white wines have a low density and high alcohol. Therefor a wine producer should maximise the fermentation to consume most of the residual sugar to make as much alcohol a possible.

The free and total sulfure dioxides were correlated because the later is containing all of them. The difference between the total and the free sulfure dioxides are called bound sulfure dioxide [http://www.practicalwinery.com/janfeb09/page2.htm]. In our sample the bound and free sulfure dioxide only have a weak (0.26) correlation coefficient.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The total sulfur dioxide and pH variation on quality seem to tell the story that the wine producer who make better wine are more in control of the sulfure dioxide or the pH.

What was the strongest relationship you found?

The strongest relationship was between the density and residual sugar. The density is strongly positively correlated with the residual sugar. An also strong negative correlation exist between the alcohol and the density.

Multivariate Plots

Wxploring Density vs Residual Sugar vs Alcohol

As exposed in the bivariate plots about the relationship betwee density, residual.sugar and alcohol. Let's get a better feeling of it.

ggplot(aes(x=density, y=residual.sugar, color=alcohol), 
       data=subset(wqw, density < quantile(density, .99))) + 
  geom_jitter() +
  scale_y_continuous(trans=log10_trans())

plot of chunk unnamed-chunk-88

We can clearly see that for a given residual sugar with higher alcohol the density is lowering. When the residual sugar increase the alcohol is lower.

ggplot(aes(x=density, y=residual.sugar, color=alcohol), 
       data=subset(wqw, density < quantile(density, .99))) + 
  geom_jitter() +
  scale_y_continuous(trans=log10_trans()) +
  facet_wrap(~ quality)

plot of chunk unnamed-chunk-89

The better wine (7 to 9) have on average a lower residual sugar and higher alcohol concentration. The worst wine (3 to 5) don't produce a lot of alcohol. The average wines (6) has those 2 caracteristics.

p1 <- ggplot(aes(x=density, y=residual.sugar, color=alcohol), 
       data=subset(wqw, density < quantile(density, .99))) + 
  geom_jitter() +
  scale_y_continuous(trans=log10_trans()) +
  ggtitle("all wines")
p2 <- ggplot(aes(x=density, y=residual.sugar, color=alcohol), 
       data=subset(wqw, density < quantile(density, .99) & quality == 6)) + 
  geom_jitter() +
  scale_y_continuous(trans=log10_trans()) +
  ggtitle("quality 6 wines")
grid.arrange(p1,p2)

plot of chunk unnamed-chunk-90

The average wines (6) are a good subset to repesent those 2 characteristics.

Let's have a look again at the total.sulfur.dioxide vs residual.sugar. Maybe by adding quality as color it would help us identify a pattern.

ggplot(aes(x=total.sulfur.dioxide, y=log10(residual.sugar), color=quality), 
      data=subset(wqw, residual.sugar < quantile(residual.sugar, .99))) +
  geom_jitter()

plot of chunk unnamed-chunk-91

Well not really helpful ….

Exploring Sweetness

As continuity with the previsou graphs, let's see how our sweetness variable can be used.

ggplot(aes(x=density, y=alcohol, color=sweetness), data=wqw) + 
   geom_jitter(alpha=.8)

plot of chunk unnamed-chunk-92

I like this plot as it connect to my past experience with different wine sweetness.

Explore Total Sulfure Dioxide

ggplot(aes(x=density, y=total.sulfur.dioxide, color=alcohol), 
       data=subset(wqw, 
                   density < quantile(density, .99) &
                   total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) + 
  geom_jitter() 

plot of chunk unnamed-chunk-93

We rathe see a relationship between density and alcohol on this previous plot.

ggplot(aes(x=density, y=total.sulfur.dioxide, color=alcohol), 
       data=subset(wqw, 
                   density < quantile(density, .99) &
                   total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) + 
  geom_jitter() +
  facet_wrap(~ quality)

plot of chunk unnamed-chunk-94

Well those last two plots are not really helping us in our exploration. let's drop the suflur dioxide and look form the angle of the contains.sulfites variable

Explore Contains Suflites

ggplot(aes(x=contains.sulfites, y=pH, color=as.factor(quality)), 
       data=subset(wqw, fixed.acidity < quantile(fixed.acidity,.99))) + 
  geom_jitter(aplha=.5) 

plot of chunk unnamed-chunk-95

ggplot(aes(x=contains.sulfites, y=alcohol, color=as.factor(quality)), 
       data=subset(wqw, fixed.acidity < quantile(fixed.acidity,.99))) + 
  geom_jitter(aplha=.2) 

plot of chunk unnamed-chunk-96

Well the only insight i get from this graph is that the low sulfites seems to be on average of better quality. Let's go back to a simple boxplot.

ggplot(aes(x=as.factor(quality), y=total.sulfur.dioxide), 
       data=wqw) + 
  geom_boxplot() 

plot of chunk unnamed-chunk-97

Back to square one with the understanding and visualisation of the suflure dioxide. I'm a bit clueless. Let's try with pH.

ggplot(aes(x=pH, y=total.sulfur.dioxide, color=as.factor(quality)), 
       data=subset(wqw, density < quantile(density,.99))) + 
  geom_jitter(alpha=.5) 

plot of chunk unnamed-chunk-98

Seems like another deadend.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I was looking at “Which chemical properties influence the quality of white wines?”.

From my exploratory data analyse, it appears that the pH, the residual sugar, the density, chlorides and the alcohol can help us identify a good wine. The lower the residual sugar and the chlorides and the higer the pH, the density and the alcohol, the better the wine.

The alcohol concentration is a good approximation of the quality of the wine as it illustrates that the fermentation process was well done and very little residual sugar is left in the bottle.

However a good wine appears to be the right balance of many chemical properties that prevent me to identify a linear model.

Were there any interesting or surprising interactions between features?

Regarding citric acid, it seem and additive commonly used accross all the quality of wines. I would have expect that good quality wine would not rely on such additive. Also i need to find a official confirmation but European Union might not allowed this additive.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No i fail to identify or transform my variables to support a linear model.

Final Plots and Summary

Plot One

ggplot(aes(x=density, y=alcohol, color=sweetness), 
       data=wqw) + 
  geom_jitter(alpha=.8) +
  xlab("Density (g/cm^3)") +
  ylab("Alcohol (%)") + 
  scale_color_discrete(name="Sweetness") +
  geom_vline(xintercept=1, linetype="dotted") +
  ggtitle("Wine Alcohol vs Density by Sweetness")

plot of chunk unnamed-chunk-99

Description One

It would be a nice plot for wider audiance. It allows to compare how the different wine sweetness impact on the alcohol and density. The dry wine could move quite high in the percentage of alcohol. Those dry wine would feel lighter. As opposed a medium white wine which still contains quite some residual sugar and would feel heavier in the mouth like water (density of 1).

Plot Two

wqw$quality <- factor(wqw$quality, levels= rev(levels(as.factor(wqw$quality))))
ggplot(aes(x=citric.acid, fill=as.factor(quality), order=as.numeric(quality)), 
       data=subset(wqw, citric.acid < quantile(citric.acid,.999))) + 
  geom_histogram(binwidth=.01) +
  ylab("Number of wines") +
  xlab("Citric Acid (g/L)") +
  scale_fill_discrete(name="Quality") +
  ggtitle("Wine Bottles' concentration in Citric Acid")
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

plot of chunk unnamed-chunk-100

Description Two

This second plot show the peak of concentration of citric acid at 0.49g/L and 0.74g/L. I was quite shocked to see that even the better wine producers were using this technique in Portugal even though the European Union is not allowing it.

Plot Three

ggplot(aes(x=density, y=residual.sugar, color=alcohol), 
       data=subset(wqw, density < quantile(density, .99))) + 

  geom_jitter() +
  scale_y_continuous(trans=log10_trans()) +
  xlab("Density (g/cm^3)") + 
  ylab("Residual Sugar (g/L)") +
  scale_color_continuous(name="Alcohol (%)") +
  ggtitle("Residual Sugar vs Density vs Alcohol")

plot of chunk unnamed-chunk-101

Description Three

This third plot illustrates the balance between sugar and alcohol in setting the density of the wine. The bottle with high sugar have a low percentage of alcohol. A more compelet fermentation process lower the residual sugar increase the alcohol and as a result lower the density. The shape of the plot is logarithmic.

Reflection

The exercise during the Udacity lessons 3 was much easier than figuring out a direction without guidance for this project. One has to go step by step. Even with a resonable number of variables (around 17 here) it was very difficult for me not to get lost. I had to move back and forth on this report to correct wrong conclusions or move plots from the univariate section to the bivariate or trivariate section.

Another big source of struggle was to match the dataset's variable with other information that i could find online. The names sulfates or sulfure or sulfite were a greate source of confusion. To add to the naming confusion some online searches provided very different averages for example with the ratio.sulfure.dioxide. After reading multiple sources online and coming back to the dataset description i slowly learnt the different componants but still i'm a bit puzzled with the difference in average.

The little success was to discover something that i already know (relationship between sugar, density and alcohol) but mostly the success feeling came when i selected the right graph for my purpose. I easily got suck in the analyze with certain type of graph. For example i couldn't find a way out with scater plot and histogram until i got the idea of using a boxplots which made a lot of relationships clearer. I also liked to add sweetness as a variable which helped me connect with the subject.

The next steps for further analyzes would be to